Finding document topics for improving topic segmentation

نویسنده

  • Olivier Ferret
چکیده

Topic segmentation and identification are often tackled as separate problems whereas they are both part of topic analysis. In this article, we study how topic identification can help to improve a topic segmenter based on word reiteration. We first present an unsupervised method for discovering the topics of a text. Then, we detail how these topics are used by segmentation for finding topical similarities between text segments. Finally, we show through the results of an evaluation done both for French and English the interest of the method we propose.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

A Dynamic Topic Model for Document Segmentation

Factor language models, like Latent Semantic Analysis, represent documents as mixtures of topics, and have a variety of applications. Normally, the mixture is computed at the whole-document level, that is, the entire document contains material on several topics, without specifying where they occur in the document. In this paper, we describe a new model which computes the topic mixture estimate ...

متن کامل

Improving Text Segmentation by Combining Endogenous and Exogenous Methods

Topic segmentation was addressed by a large amount of work from which it is not easy to draw conclusions, especially about the need for knowledge. In this article, we propose to combine in the same framework two methods for improving the results of a topic segmenter based on lexical reiteration. The first one is endogenous and exploits the distributional similarity of words in a document for di...

متن کامل

Topic Modeling in Financial Documents

This paper describes the application of topic modeling techniques to quarterly earnings call transcripts of publicly traded companies. Earnings call transcripts represent an interesting case for analysis because the document is relatively unstructured and potentially more informative than 10K and 10Q disclosures due to the question and answer session consisting of unprepared statements. This pa...

متن کامل

Sampling Table Configurations for the Hierarchical Poisson-Dirichlet Process

•Discrete hierarchies are ubiquitous in intelligent systems. • The Poisson-Dirichlet process (PDP ) [1] allow statistical inference and learning on discrete hierarchies, e.g., hierarchy of Dirichlet distributions. • Applications of the PDP/HPDP include but not limited to: – Topic modeling: Finding meaningful topics discussed in large set of documents. Beneficial to automatic document analysis a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007